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Abstract 

Background: Data on causes of death by age and sex are a critical input into health decision-making. Priority 
setting in public health should be informed not only by the current magnitude of health problems but by trends 
in them. However, cause of death data are often not available or are subject to substantial problems of 
comparability. We propose five general principles for cause of death model development, validation, and reporting. 

Methods: We detail a specific implementation of these principles that is embodied in an analytical tool - the 
Cause of Death Ensemble model (CODEm) - which explores a large variety of possible models to estimate trends 
in causes of death. Possible models are identified using a covariate selection algorithm that yields many plausible 
combinations of covariates, which are then run through four model classes. The model classes include mixed 
effects linear models and spatial-temporal Gaussian Process Regression models for cause fractions and death rates. 
All models for each cause of death are then assessed using out-of-sample predictive validity and combined into an 
ensemble with optimal out-of-sample predictive performance. 

Results: Ensemble models for cause of death estimation outperform any single component model in tests of root 
mean square error, frequency of predicting correct temporal trends, and achieving 95% coverage of the prediction 
interval. We present detailed results for CODEm applied to maternal mortality and summary results for several 
other causes of death, including cardiovascular disease and several cancers. 

Conclusions: CODEm produces better estimates of cause of death trends than previous methods and is less 
susceptible to bias in model specification. We demonstrate the utility of CODEm for the estimation of several major 
causes of death. 

Keywords: cause of death, ensemble models, predictive validity, spatial-temporal models, maternal mortality, Glo- 
bal Burden of Disease 



Background 

Data on causes of death by age and sex are a critical 
input into health decision-making. Nations devote con- 
siderable resources to collecting, collating, and analyzing 
various types of cause of death data for this reason 
[1-3]. Priority setting in public health, however, should 
be informed not only by the current magnitude of 
health problems but by trends in them. Whether or not 
a cause of death is increasing or decreasing is important 
information as to whether current disease control efforts 
are working or inadequate. The rising burden of 
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diabetes and the policy debate it has triggered is a good 
example of the importance of monitoring national 
trends in causes of death [4,5]. 

The fundamental challenge for most countries, how- 
ever, is that cause of death data are often not available 
or subject to substantial problems of comparability. 
Even in the 89 countries with complete vital registration 
systems and medical certification of causes of death in 
2009, many issues of comparability remain [6-8]. Dra- 
matic changes from year to year in death rates from a 
cause can be due to changes in International Classifica- 
tion of Diseases (ICD) revision [9,10] or national modifi- 
cations of coding rules [11-14]. In some cases, causes 
such as HIV or diabetes may be systematically misclassi- 
fied [8,15-21]. The fraction of deaths assigned to causes 
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that are not true underlying causes of death can vary 
widely and change over time [20,22-27]. In places with- 
out complete vital registration, a range of sources such 
as verbal autopsy studies (national or subnational), par- 
tial urban vital registration, or survey/census data may 
be available. Data may be available only for a limited 
number of years and these data are often subject to sub- 
stantial sampling and even larger nonsampling error. 
Generating national assessments of causes of death by 
age, sex, and year requires a strategy and methodology 
to deal with this diverse set of data issues. 

Efforts to model causes of death using available data 
have a long history [28-34]. Initial attempts focused on 
estimating causes of death for a cross-section of coun- 
tries by modeling cause as a function of overall mortal- 
ity levels. For example, Preston's 1976 Mortality 
Patterns in National Populations: With Special Reference 
to Recorded Causes of Death [28] was the initial effort 
to assess trends in causes of death taking into account 
misclassification of deaths. The demand for estimates of 
both levels and trends in key causes from diverse groups 
have led to multiple recent studies on diarrhea, maternal 
mortality, and other causes of death [29-31,35-37]. 
These studies have used a wide variety of analytical stra- 
tegies and specific model implementations. The recent 
debate on maternal mortality estimation [35,36,38-41] is 
an illustration of quite different choices of the depen- 
dent variable and model specifications. The use of Gaus- 
sian Process Regression (GPR) and other related 
techniques has been used for all-cause mortality in chil- 
dren and adults and in time series cross-sectional work 
on key risk factors [5,42-45]. Affordable computational 
power and innovations in Bayesian statistical modeling 
have fueled a steady growth in alternative estimation 
strategies. This innovation is likely to continue for the 
foreseeable future. 

Comparing alternative modeling approaches applied to 
the same cause of death is complicated by a lack of 
accepted standards for good cause of death modeling 
practice. Preferences for the results of alternative strate- 
gies may be based not on documented performance but 
on impressionistic grounds. In this paper, we propose 
five general principles for cause of death model develop- 
ment, validation, and reporting. We then detail a speci- 
fic implementation of these principles that is embodied 
in an analytical tool, CODEm - the Cause of Death 
Ensemble model. 

Principles for cause of death model development 
1. Identify all the available data 

Good cause of death modeling practice begins with a 
systematic attempt to identify all the available data. 
Most cause of death data is captured through a variety 
of national data collection systems such as partial or 



complete vital registration or national or sample regis- 
tration systems with verbal autopsy. Most of these data 
are not published in the scientific literature but are 
available through national sources or the World Health 
Organization (in the case of vital registration with medi- 
cal certification of causes of death). These main sources 
can also be supplemented with subnational studies on 
select causes or age groups from the published literature 
through systematic reviews. For some diseases, there 
may be special sources of information, such as popula- 
tion-based cancer registry data for mortality from 
selected cancers in particular catchment areas. 
2. Maximize the comparability and quality of the dataset 
After all the available data have been identified, several 
common challenges for the comparability and quality of 
cause of death data need to be addressed, including 
mapping across various revisions of the ICD, variation 
in garbage coding across countries and time, misclassifi- 
cation due to poor diagnostic capacity, comparability of 
alternative verbal autopsy methods, completeness of 
cause of death registration, and large nonsampling var- 
iance. There is an extensive literature on the mapping 
for different causes across revisions of the ICD [13,46]; 
the challenge is greater for certain specific causes of 
death. A second important source of known bias is the 
assignment of a substantial fraction of deaths to causes 
of death that are not underlying causes of death, often 
called "garbage codes"[27, 47-49]. Preston in 1976 
already noted that trends in cardiovascular disease over 
time were profoundly different if garbage codes were 
taken into account [28,36]. The Global Burden of Dis- 
eases, Injuries, and Risk Factors (GBD) 1990 Study 
introduced simple algorithms for redistributing deaths 
from major garbage codes [34], and these were refined 
for the GBD 2000 Study work [50]. More detailed algo- 
rithms driven by a more detailed examination of disease 
pathology have since been proposed [1,51]. Special 
methods have been proposed for selected causes, such 
as HIV in populations where the cause is often misclas- 
sified due to stigma or other factors. For example, Birn- 
baum et al. found that many HIV deaths in South 
Africa had been classified to other causes including 
tuberculosis, pneumonias, and other infectious diseases 
[52]. The substantial difference in the strategy for cor- 
recting misclassification in two recent studies on mater- 
nal mortality illustrates the spectrum of approaches in 
use [36,39,40]. Uncertainty in the correction for known 
bias should, in principle, be propagated into the uncer- 
tainty in the results. Methods for quantifying this uncer- 
tainty, however, have not yet been developed. A third 
critical factor in enhancing comparability and quality is 
to correct for the fact that in some vital event registra- 
tion systems, not all deaths are captured. Death rates 
based on these systems need to be corrected for the 
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completeness of death registration. Various methods 
have been proposed to correct for completeness [34,53]. 

Even after various known biases due to ICD classifica- 
tion changes, garbage coding, and completeness have 
been taken into account, the results of some studies 
may be subject to large nonsampling error. Nonsam- 
pling error can come about from a wide range of factors 
that can affect vital registration, verbal autopsy, and 
other surveillance data sources. Explicit criteria should 
be used to detect extreme outliers in the rates or cause 
fractions derived from different sources. Based on expli- 
cit criteria, outlier data points should be excluded from 
the cause of death modeling analysis. 
3. Develop a diverse set of plausible models 
There are a myriad of choices to be made when model- 
ing cause of death trends. Cause of death data are often 
sparse for many low-income countries and may also be 
more available for certain time periods such as 1995 to 
2005 than for earlier or later periods; predictions for 
data-sparse time periods and/or populations are often 
sensitive to subtle differences in model specification or 
the form of the dependent variable; modeling cause-spe- 
cific mortality rates or cause-fractions can yield very dif- 
ferent results; and choosing to use educational 
attainment rather than income per capita can lead to 
major differences in the predicted trend in causes of 
death in places with economic downturns. We believe 
that good modeling practice starts by casting a wide net 
in terms of proposed models. 

Whatever models that are proposed, however, should 
meet basic plausibility criteria. Known strong biological 
or behavioral relationships should be respected. For 
example, models for the age-specific stroke death rate 
should have a positive coefficient on mean systolic 
blood pressure of the population, models for lung can- 
cer should have a positive coefficient on tobacco con- 
sumption, and so on. A set of models for which these 
known directions of relationships are respected provides 
a more robust platform for assessing models and creat- 
ing model ensembles. 

The diverse set of models we test also includes various 
ways of combining the predictions of multiple individual 
models. Experience in many fields - ranging from 
meteorology, soil chemistry, and stocks to the Netflix 
Challenge and others - have demonstrated that ensem- 
ble models have smaller prediction error than the best 
single model [54-61]. Ensemble models that are 
weighted combinations of the posterior distributions of 
component models provide lower error for the point 
estimate and more accurate uncertainty intervals 
[62-66]. Additionally, ensemble models capture uncer- 
tainty not only due to the parameters in any one model 
but also the uncertainty of predictions due to differences 
in specification across models. Many methods have been 



proposed and used for developing weights for compo- 
nent models in an ensemble, including Bayesian Model 
Averaging (BMA), averaging of all plausible models, and 
using fixed or arbitrary weights [67]. 
4. Assess the predictive validity of each plausible individual 
model and of ensemble models 

In-sample fit is not a robust measure of prediction when 
data are sparse or missing. Instead, out-of-sample pre- 
dictive validity is the appropriate measure [66,68-72]. If 
one model generates better out-of-sample predictions 
than another, we should prefer this model for the task 
of prediction. Assessing predictive validity requires hold- 
ing out some fraction of the data from model building 
and then assessing the predictions from the model 
against the data that have been held out. The compel- 
ling logic that modeling strategies used for prediction 
should be assessed through out-of-sample predictive 
validity is unlikely to be controversial. There are, how- 
ever, many options for assessing predictive validity that 
deserve exploration, including which data to hold out, 
how many repeated samples to perform, and which 
metrics of validity to calculate. 

In bioinformatics and genomics, the standard predic- 
tive validity approach is to perform five- or 10-fold 
cross-validation, in which 10% or 20% of the data are 
held out and models are developed on the remaining 
data, a process which is then repeated multiple times to 
ensure stability. The statistical literature suggests that 
leave-one-out methods overestimate performance, 
instead recommending that larger hold-outs be used 
[73-75]. For cause of death estimation, we can identify 
five distinct scenarios for prediction: a) countries with 
no data; b) countries with missing data years but with 
data from years before and after the missing sequence; 
c) data missing at the beginning of the sequence; d) data 
missing at the end of the sequence; and e) data for some 
age groups such as children or reproductive-age females 
and not for other age groups. Countries may often be 
characterized by a combination of these scenarios. Dif- 
ferent models can and do in fact perform differentially 
on these various tasks. We believe that the data hold- 
out strategy should mimic the task required of the 
model. In other words, we should hold out sequences of 
data in proportion to how often the model would need 
to fill in such a sequence (Figure 1 illustrates the con- 
struct). In this way, the hold-out will reflect the appro- 
priate mix of the five scenarios that have been 
described. Producing train-test datasets through a hold- 
out strategy should be repeated often enough to yield 
stable assessments of predictive validity that are not a 
function of the idiosyncratic nature of a particular ran- 
domly generated data hold-out. 

The metrics of predictive validity are more complex. 
Three distinct dimensions should be assessed: errors in 
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Scenario A: Complete Missingness Scenario D: Missingness at End of Sequence 
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Scenario B: Missingness in Middle of Sequence Scenario E: Missingness in Some Age Groups 
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Scenario C: Missingness at Beginning of Sequence 
Year 

1980 1985 1990 1995 2000 2005 2010 1980 
0 



Combination of Scenarios 
Year 

1985 1990 1995 2000 2005 2010 



20 



0) 

u> 40 
< 



60 



80 



20 



a 

u> 40 
< 



60 



80 



Figure 1 Examples of missingness patterns observed in data for a single country and cause of death Green represents available data 
and red represents missing. 



predicting levels, errors in predicting trends, and the error and metrics that are based on variances or abso- 

adequacy of uncertainty intervals. First, for assessing lute deviations. Second, two models may yield similar 

prediction error for the quantity of interest, one can metrics of prediction error on the levels of mortality but 

imagine metrics that assess absolute error or relative one model may more robustly predict the trend in a 
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given cause of death. For many applications of cause of 
death modeling, estimating an accurate trend may be 
even more important than estimating an accurate level. 
Predictive validity for trends can be assessed by examin- 
ing the first differences over time for the prediction and 
the data and assessing error in this dimension. Third, it 
is particularly important to derive uncertainty intervals 
for cause of death estimation that are accurate. Cover- 
age of the uncertainty intervals of any proposed model 
should be reported taking into account the sampling 
and nonsampling error estimated for a model. The 
reported uncertainty interval for a model is the uncer- 
tainty in the expected value of the death rate. This 
should be much narrower than the interval for the pre- 
diction of data measurements of that expected quantity 
of interest. An ideal model should have data coverage 
close to or greater than 95%. 

Some ensemble models use information on the perfor- 
mance of component models to generate the weights on 
the component models [65] . If weights are chosen from 
in-sample fit, these may not yield the best results. In 
cases where ensemble weights need to be selected on 
the basis of out-of-sample performance, a train-test 1- 
test 2 strategy needs to be used. In these cases, some 
percentage of the data, such as 70%, are set aside for the 
training set, and the remaining data are randomly split 
into two test datasets. The first test is used to develop 
ensemble weights, and the performance of all compo- 
nent models and the ensembles are compared using the 
second test dataset. In this way, the comparison in the 
second test between the component models and variants 
of ensemble models is equitable, using data that have 
not previously been used for any modeling strategy. 
Train-test 1-test 2 strategies will need to be repeated 
multiple times until stable performance metrics have 
been achieved. 

5. Choose the model or ensemble model with the best 
performance in the out-of-sample predictive validity tests 

In selecting a best single model or model ensemble, the 
three types of assessments may have to be balanced. 
The model with the lowest root mean squared error 
(RMSE) may do worse on trend or have coverage that is 
too high or too low. Balancing these attributes will 
depend on the specific cause of death application. An 
overall judgment to choose the final model or model 
ensemble should take into account preferences for these 
three domains. A set of preferences should in principle 
be translated into an objective function that captures 
the desired trade-offs between different dimensions of 
performance. Most users will, in fact, want to balance 
various attributes of predictions, and they will want to 
examine metrics of predictive validity that capture these 
various attributes. In all cases, model selection should 
be designed to be robust to outliers. 



Methods 

Cause of Death Ensemble model (CODEm) - an integrated 
cause of death modeling environment 

We have developed a cause of death modeling environ- 
ment to facilitate work on modeling cause- specific mor- 
tality for a large number of countries, which can be 
applied to any cause of death for which data are avail- 
able. To design this modeling tool we have developed a 
specific implementation of the five principles that we 
have outlined above. Many specific choices were 
required to develop a computationally tractable but flex- 
ible strategy that is consistent with these principles. In 
this section, we describe in detail these design choices, 
including the development of a large set of plausible 
models, the development of ensemble models using 
adaptive weighting systems, the assessment of out-of- 
sample predictive validity, and final results using mater- 
nal mortality as a case study. 

We illustrate the application of CODEm to modeling 
several major causes of death using the cause of death 
database that has been developed at the Institute for 
Health Metrics and Evaluation. This database has been 
developed following the first two principles outlined 
above. For reference, Table 1 summarizes the available 
cause of death data from vital registration systems, ver- 
bal autopsy studies, surveillance systems, and various 
surveys/censuses with some cause-specific data. In addi- 
tion, it includes data based on information collected at 
hospitals, mortuaries, burial sites, etc. Data inputs have 
been processed to deal with various issues to enhance 
comparability. For example, Naghavi et al. have devel- 
oped algorithms to systematically deal with problems of 
ICD revision comparability and the phenomenon of 
"garbage coding" [49]. In other cases, datasets have been 
made comparable by mapping from aggregated age 
groups to five-year age groups. The end result of this 
work is a database of multiple sources of cause of death 
data that is continuously updated as new datasets are 
identified. While we use this database to illustrate the 
application of CODEm, in principle CODEm can be 
applied to any cause of death dataset. 



Table 1 CODEm data sources by type and decade 



Type 


1980-1989 


1990-1999 


2000-2010 


Total 


Cancer registry 


389 


505 


392 


1286 


Other 


0 


14 


52 


66 


Sibling history 


410 


807 


325 


1542 


Surveillance 


3 


31 


55 


89 


Survey/census 


1 


54 


49 


104 


Verbal autopsy 


138 


175 


143 


456 


Vital registration 


799 


969 


909 


2677 
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Developing plausible models 
Model families 

We have developed four families of models that reflect 
the choice of the dependent variable and the use of spa- 
tial-temporal patterns in the data. Table 2 shows the 
four families of models. One choice is whether to model 
cause fractions or death rates; the second choice is 
whether to use a simple linear hierarchical model or to 
also capture spatial-temporal patterns in the unex- 
plained component of the data. As highlighted in Table 
2, taking both choices into account yields four modeling 
families. We model cause fractions using the logit of the 
cause fraction. Mortality has long been shown to have 
an exponential relationship to units of time [28,76], so 
we model death rates using the natural logarithm (In) of 
rates. For each family of models, we include multiple 
possible covariate combinations, resulting in possibly 
hundreds of different models. 
Choosing covariates 

For each cause of death, one could postulate many feasi- 
ble models based on the published literature where 
cohort studies, cross-sectional time series studies, or 
intervention trials have suggested that there is a rela- 
tionship between a covariate and a specific cause of 
death. Because of multicollinearity, including all possible 
covariates in a model often yields implausible signs on 
covariates or unstable coefficients, as well as overfitting. 
There is an extensive literature on choosing covariates 
in these circumstances, including forward and backward 
stepwise regression [77,78], LASSO (Ll-constrained 
regression) [79], elastic nets [80], Bayesian variable 
selection [81], and other covariate selection methods. 
Especially when predicting out of sample for countries 
with no data or very limited data, the specific choice of 
covariates can make a large difference on prediction 
[82,83]. Building on the various traditions for choosing 
covariates, we have developed an algorithm that cap- 
tures both strong views on the plausible relationships 
between covariates and the relevant cause of death as 
well as our desire to propose a diversity of plausible 
models. Ideally the uncertainty of the covariates would 
also be taken into account, but it is not computationally 
feasible at present to bootstrap each covariate. 

Our solution to this problem is a covariate selection 
algorithm that takes into account prior information on 
which covariates are important and what sort of effect 



Table 2 Families of CODEm component models 





Linear mixed effects 


Spatial-temporal 




model 


model 


Ln(cause-specific 


1 


3 


death rate) 






Logit(cause fraction) 


2 


4 



they should have on the dependent variable. We first 
select n covariates and categorize them into three 
groups based on how strong the evidence is of the cau- 
sal connection. For each disease, the literature was 
searched for previous studies showing correlations 
between available covariates and cause-specific mortality 
(particularly at the population level), and disease experts 
were consulted. Covariates with strong proximal rela- 
tionships, such as etiological or biological roles in the 
disease process, are ranked as level 1. Covariates for 
which there is strong evidence for a relationship, but 
not a direct biological link, are placed in level 2. Covari- 
ates with weak evidence for a relationship, or which 
would be distal in the causal chain and thus may be 
mediated by factors in levels 1 or 2, are categorized as 
level 3. Based on the literature, we assign a prior on the 
direction of each covariate. Covariates that should 
increase the dependent variable are classified as having a 
positive prior, while those that should be inversely 
related are given a negative prior. If there is conflicting 
or inconclusive evidence as to the expected direction, 
the user can also specify that either direction would be 
valid. By definition, we would not expect to assign an 
ambiguous direction for a level 1 covariate. 

After priors have been set according to level of evi- 
dence and presumed direction, a list of all possible cov- 
ariate combinations for level 1 is created. We test all 2 n - 
1 combinations of level 1 covariates. We retain all mod- 
els where the sign on all covariates in that model is in 
the expected direction and the coefficient is significant 
at the p < .05 level. In each case, the model is estimated 
with the mixed effects structure used in the spatial-tem- 
poral model described below (random effects on super- 
region, region, and age), with the covariates under 
exploration included as fixed effects. So as to not have 
to retest each of these covariate combinations with the 
linear model family described below, we make the sim- 
plifying assumption that the inclusion of an additional 
country random effect will not change the direction or 
significance of the fixed effects; this assumption has held 
true in our testing and greatly reduces computational 
burden. 

Level 2 covariates are assessed using a more computa- 
tionally efficient approach. First, for each level 1 model 
that was retained, we create a list of 2 m possible level 2 
models (where m is the number of level 2 covariates). 
The first, which has no level 2 covariates included, has 
already been tested and is retained. Next, each of the m 
possible models in which one covariate is added to the 
level 1 model is tested. If adding the level 2 covariate 
does not affect either the significance or the sign on any 
level 1 coefficients, and the level 2 covariate itself fits 
the priors on direction and significance, then it is 
retained as another possible model. If, however, the 
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level 2 covariate does not fulfill the priors or forces any 
of the level 1 covariates to violate their priors, then it is 
dropped. 

In addition, all other possible level 2 models that contain 
that covariate are also dropped. For example, if we have a 
level 1 model A, and three possible level 2 covariates, X, Y, 
and Z, we first test each of them individually against level 

1 model A. Suppose x and Y fulfill the priors, but Z vio- 
lates them. Then, we can eliminate the level 2 models Z, X 
+Z, Y+Z, and X+Y+Z. Thus, we can test just one more 
level 2 model, X+Y. If that model fulfills the prior, then we 
are left with 4 different models from our original level 1 
model: A (level 1 only), A+X, A+Y, and A+X+Y. This is a 
conservative approach because it reduces computational 
burden - both for the software which thus has fewer mod- 
els to evaluate, and for the analyst who typically lacks the 
sorts of information that would be necessary to formulate 
conditional priors for multiple covariates. However, if the 
analyst has sufficient time and computational power, all 
covariates can be designated as level 1 and thus all combi- 
nations of covariates will be tested. 

Next, we take each of the models resulting from level 

2 and use the same process as described for level 2 on 
the level 3 covariates. In this way, we can either test or 
preemptively eliminate all (2 n -l)*2 m *2 1 possible covariate 
models (where n = the number of level 1 covariates, m 
the number of level 2 covariates, and / the number of 
level 3 covariates) in an efficient manner. Ultimately we 
obtain a set of all possible covariate combinations that 
fulfill our priors on covariate direction. This ensures 
that we do not exclude any potentially valuable informa- 
tion in our modeling process. We run the covariate 
selection tool for both cause fractions and death rates 
and then create both mixed effects only and spatial-tem- 
poral models for each set of chosen covariates. 

This covariate selection method, which produces a list 
of models for which prior covariate relationship beliefs 
are maintained, allows for great flexibility in model 
choice. If a user specifies all-cause mortality as a level 1 
covariate, then a classical Preston-type cause of death 
model would be tested. Or if there is strong reason to 
believe that cohort effects are prominent for a cause of 
death, then cohort-lagged measures could be incorpo- 
rated. For instance, in the case of lung cancer an analyst 
could test five-year cohort lags of smoking prevalence 
(in which the value used to estimate lung cancer in 60 
year olds in 2005 would be the smoking prevalence 
amongst 55 year olds in 2000), 10-year cohort lags (in 
which lung cancer deaths in 60 year olds in 2005 would 
be based on smoking prevalence amongst 50 year olds 
in 1995), and so forth. By using covariate selection to 
select a broad pool of logical models, different strategies 
such as these can be employed and compete on the 
same metrics of performance. 



Linear mixed effects models 

Model families 1 and 2 use a mixed effects model with 
fixed effects on covariates (selected via the mechanism 
explained below) and age dummies, plus hierarchical ran- 
dom effects by super-region, region, country, and age. 
The fixed effects allow us to capture broad trends in both 
age patterns and the impacts of key biological and envir- 
onmental covariates. The random effects allow for 
improved estimation by adding intercept shifts by GBD 
super-region, region, and country (the 187 countries are 
grouped into 21 regions based on both geographical 
proximity and epidemiologic similarity; the 21 regions 
are further grouped into seven more general super- 
regions), and changes in age patterns across regions and 
countries. The models of this family follow this form: 

In (rate s , r , c ,y,a) = ^i SJ , Ciy , a + Pad + n s + 7t s , r 
+ 7Is,r,a + lts,r,a,c + £ s,r,c,y,a 

logit (cause fraction s ,r, c ,y, a) = Pi x i s , r ,c,y,a 
+ fia.d + 7t s + 7t s , r + Tt s ,r,a + ^s,r,a,c + £ s,r,c,y,a 
Where: 

s = super-region index; r = region index; c = country 
index; y = year index; a = age index 

[countries are nested within regions, which are nested 
within super-regions] 

Pi = coefficient on covariate i 

Xis, r, c, y , a = covariate i for observation s, r, c, y, a 
P a = coefficient on age offsets 
d = age dummy variables 
n s = random intercept on super-region 
n s> r = random intercept on region (nested within 
super-region) 

n s, r, a = random intercept on age (nested within 
region) 

n s, r, a, c = random intercept on country (nested within 
region-age) 

£s,r,c,y,a ~ N(0, a £ • I) 

Because of the small numbers that are often encoun- 
tered for certain age groups, countries, or causes of 
death, covariate models may occasionally predict num- 
bers that are negative in natural log or logit space. To 
avoid creating very large residuals that can negatively 
affect subsequent prediction steps, we have introduced a 
floor such that the predictions never go below a rate of 
.01 deaths per 100,000 people. In addition, log rate 
models, unlike logit cause-fraction models, are not con- 
strained from predicting more deaths than the all-cause 
mortality rate. We have greater confidence in all-cause 
mortality predictions, because there are typically more 
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data available for predicting all-cause mortality rates in 
the form of censuses, demographic and health surveys, 
vital registration systems that do not capture cause of 
death, etc. We have therefore placed a ceiling on log 
rate models such that they can never exceed the log all- 
cause mortality rate. 
Spatial-temporal models 

Model families 3 and 4 begin with nearly the same 
mixed effects model used for families 1 and 2, but with 
the removal of the random country effect. The spatial- 
temporal models then utilize additional regression analy- 
sis to take into account how the dependent variable 
further varies across time, space, and age. This type of 
spatial-temporal regression model has been used in 
many applications, including the estimation of maternal 
mortality [36]. We do this by calculating the residual 
(predicted - observed dependent variable) for each data 
point and then run local regression in three dimensions 
on the residual. The process assumes that residuals con- 
tain valuable information that cannot be directly 
observed but nonetheless vary systematically across geo- 
graphic region, time, and age group. This allows us to 
predict how much the observed dependent variable dif- 
fers from the mixed effects model's prediction and to 
account for these differences. 

In order to perform the local regressions we must 
cycle through each observation in the dataset, weight 
every other observation in the dataset relative to it, and 
then find the weighted mean of the residual term. The 
first dimension across which we calculate weights is age. 
We use a simple exponential decay function 



1 



7 ru x\agegroupi—agegroupi 



such that if observation / has the same age group as the 
observation we're predicting for (i), then it receives an age 
weight of 1. When co is set to its default value of 1, if /' is in 
the adjacent age group, it receives a weight of .367, if it is 
two age groups away the weight is .135, etc. We chose this 
weighting scheme recognizing that mortality estimates 
typically change smoothly over age. For causes with sparse 
data, co can be decreased to induce greater smoothing over 
age; in causes of death that we expect to have very differ- 
ent levels by age, co can be increased to reduce smoothing 
over age groups. 

Next, we weight all observations /' relative to observation 
i in time by using a weighting scheme similar to the tricu- 
bic weights used in traditional LOESS local regression: 



w kj = 1 



year; — yearj 



argmax(| yearj — yearj |) + 1 



The key difference between our time weights and tra- 
ditional LOESS weights is that we leave A, as a para- 
meter that can be tuned to increase or decrease how 
much smoothing occurs across time. We use A, = .5 for 
countries that have data. If we were predicting the year 
1995 in a time series from 1980 to 2010, then this 
would correspond to a time weight of 1 for observations 
from 1995, a weight of .42 for observations from 1994 
or 1996, a weight of .27 for 1993 or 1997, a weight of 
.09 for 1990 or 2000, and a weight of close to 0 for 
1980 or 2010. For countries without data, we have 
much less certainty in their trends; thus, we use a higher 
A, of 2, which smooths out predictions and avoids issues 
of compositional bias. 

The age and time weights are multiplied together for 
each observation, producing a weight that reflects proxi- 
mity in both dimensions. Then, the weights are rescaled 
to reflect geographical proximity to the observation 
being predicted. We use the following formula to rescale 
the weights: 



Wu = f X 



W ai4 X W ti 



E («%• X U\ ; ) 



For data from country i 



= £ X (1 - £) X 



t% x w ti 



E (i% x u/ £i j ) 



For data from region i, but not country i 

w i4 = (1 - O 2 x 



E x U/ty) 



For data from super-region i, but not region i 
Wj j = 0 for data from outside of super-region i 
We use a value of C, = .9 for countries with data, such 
that 90% of the weight in the local regression is given to 
observations from the same country, 9% is given to data 
from the same region but outside the country, and just 
1% is given to data in other parts of the super-region. 
When there are no data for a country, we use C, = .7, 
which gives 70% of the weight to data from within the 
region and 30% to data from other parts of the super- 
region; this reflects the fact that we wish to borrow 
more strength in such data-sparse areas. 

In addition to the geographic weighting described 
above, we also rescale the weights inside a country 
when both national and subnational data are present. In 
cases where we have both national and subnational 
observations for the country we're predicting, we rescale 
the weights such that 90% of the in- country weight goes 
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to the nationally representative data points and 10% is 
assigned to the subnational data. This reflects our desire 
to capture nationally representative trends. 

Once these weights have been calculated, weighting 
every observation in the dataset relative to the one 
being predicted, it a simple matter of calculating a 
weighted average of the residuals from the mixed effects 
regression. This "predicted residual" is then added back 
onto the mixed effects prediction, creating an estimate 
that more closely takes into account aspects of the data 
that cannot be captured by a simple covariate model. 
Gaussian Process Regression 

Gaussian Process Regression (GPR) is a Bayesian estima- 
tion technique that is well-suited to estimating time ser- 
ies data because it maintains correlation in the 
uncertainty over time. The resulting estimates also track 
in-sample data very closely without significantly chan- 
ging the predictions out of sample. We utilize GPR as 
the final step in our spatial-temporal component mod- 
els. The inputs required are the mean function, ampli- 
tude, scale, degree of differentiability, and data variance. 

Each of the models developed for the different combi- 
nations of covariates in the two spatial-temporal families 
provides the mean function (a prior estimate of the 
dependent variable). We run a separate instance of GPR 
for each country-age group, such that GPR will simply 
pick up temporal trends. Amplitude is meant to capture 
the uncertainty in the mean function; a higher ampli- 
tude means that the GPR will be more likely to move 
far away from the mean function, conditional on the 
data. We estimate the amplitude for each model using a 
robust estimate of the standard deviation of the resi- 
duals from each first stage model - specifically, we com- 
pute the standard deviation as 1.4826 times the median 
absolute deviation (MAD) estimate of the residuals. 

The GPR parameters governing the covariance func- 
tion are scale and differentiability. We assume a scale of 
10 years and a differentiability of 2. In principle, these 
can be varied and the performance of different models 
tested out of sample to choose the best parameters. We 
have not, however, varied these as the performance of 
GPR in these applications does not appear to be very 
sensitive to the choice of these parameters. 

For data variance, we would ideally like to capture 
variance due to sample size and sample design, as well 
as the data variance due to the myriad sources of non- 
sampling variance. Sampling variance can usually be 
estimated from each source on the basis of sample size 
and sample design. The more challenging task is to esti- 
mate the data variance due to nonsampling error. Work 
on child mortality and maternal mortality suggests that 
nonsampling variance is often substantially larger than 
sampling variance. In settings where multiple measure- 
ments in the same place and time period are available, it 



is possible to directly compute nonsampling variance. 
However, in most cases we have insufficient data in the 
same country-year to provide a direct measurement of 
nonsampling error. 

To approximate the nonsampling error, we first com- 
pute a simple spatial-temporal weighted average of the 
natural log of the death rates. We then estimate for 
three cases the MAD estimator of the residual from this 
weighted average: for countries with more than 10-year 
sequences of vital registration data (representing systems 
with the highest data recording), for subnational data 
when national data are also present for the country, and 
all other cases. The MAD times 1.4826 is a robust esti- 
mate of the standard deviation [84]. The MAD estima- 
tor of the residuals includes both sampling and 
nonsampling error as well as the systematic variation in 
death rates not adequately captured by the spatial-tem- 
poral weighted average. As such it is an overestimate of 
nonsampling variance (NSV). We believe it is preferable 
to overestimate rather than underestimate data variance 
and have used the NSV for each data type from this 
procedure for all models. 
Uncertainty estimation 

In order to find uncertainty on model families 1 and 2, 
we first draw multiple times from the variance-covar- 
iance matrix of the fixed effects. We then simulate pre- 
dicted death rates using these draws of beta, giving us 
an estimate of the parameter uncertainty. In order to 
capture the systematic variance in the model, we also 
add to each draw a random value from Normal(0, a syste - 
matic) where o 2 systematic is estimated as the square of 
1.4826 times the MAD of the model residuals minus the 
estimated nonsampling variance. 

In order to be able to capture uncertainty and the cor- 
relation structure over time in uncertainty and to be 
able to distinguish between uncertainty in the quantity 
of interest and sampling and nonsampling variance, we 
have used GPR [85] for both families of spatial-temporal 
models (3 and 4). We use the output of all covariate 
possibilities for the two spatial-temporal local regres- 
sions as the mean function ("prior") in GPR. Our imple- 
mentation of GPR follows the approach of Rajaratnam 
et al. used for child and adult mortality estimation 
[42,86]. GPR input derivations are explained in the pre- 
vious section. 
Creating ensemble models 

In addition to the component models in the four 
families described above, following the literature on pre- 
diction we have also created and tested ensemble mod- 
els that are themselves made up from weighted draws 
from the posterior distribution of component models. 
Both theoretical and empirical findings suggest that 
ensemble models often have slightly better predictions 
but substantially more accurate assessments of 
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prediction uncertainty [87]. One potential key criticism 
of cause of death predictions is that uncertainty due to 
model specification is not captured. Ensemble models 
provide a clear strategy to capture specification uncer- 
tainty in the prediction intervals. 

There are multiple approaches to developing compo- 
nent models. Ensemble Bayesian Model Averaging 
(Ensemble BMA) models as proposed by Raftery, Gneit- 
ing, et al. [65] assess the probability of each model condi- 
tional on the training data. The ensemble model is then 
created based on weights that are equal to the probability 
of each model divided by the sum of the probabilities of 
all models. Ensemble BMA models have been found to 
outperform the best component model in weather fore- 
casting, hydrology, and other fields [59,65,87,88]. Ensem- 
ble BMA weights are generally based on the performance 
of the model in the training data (see below for the crea- 
tion of training and test datasets). 

In the case of our cause of death applications, we have 
found that in-sample performance in the training data- 
set can be a poor predictor for performance out of sam- 
ple in the test data. Specifically, some models with good 
or even the best performance in sample may yield poor 
performance out of sample. In these cases, it is possible 
to develop ensemble weights based on the probability of 
the model conditional on the test data, not the training 
data. To demonstrate objectively the performance of an 
ensemble weighted by the out-of-sample performance of 
the component models requires a strategy of splitting 
the data into three groups: training, a test dataset 1 used 
for estimating the probability of the model conditional 
on the test data, and a test dataset 2 to assess perfor- 
mance of the component models and ensembles based 
on out-of-sample performance. 

Based on the Netflix challenge [54] and other predic- 
tion efforts, other ensemble strategies may also perform 
well. Simple averaging of all plausible models is one pos- 
sibility. Simple averages of all models or of the top per- 
forming models can avoid adding to the variance in the 
model due to misestimation of the weights on each com- 
ponent model. A compromise approach is to average the 
top X models, which effectively puts 0 or 1/X weights on 
each component model. Another option is to use a 
monotonically declining function for weights based on 
the ranked performance of each component. This 
approach provides some reward for better performance 
but does not try to directly estimate a weight from the 
data. One example of a monotonically declining weight 
on each component model is to use the function: 

^(N—ranki) 

W i = / T 

E N (^(N-i)) 



Where N is the number of models and v|/ is a para- 
meter influencing the relative weighting of models. In 
this case, if there were four total models to choose from 
and v|/ = 1.2, the top performing model would be given 
32.2% weight, the second model 26.8%, the third 22.4%, 
and the last 18.6%. We use the weights to determine 
how many draws from each model to add to the final 
pool. We take 1,000 draws overall, so the top perform- 
ing model would have 533 draws included in the aver- 
age in this example, whereas the worst performer would 
have only 67 draws contributing to the final average. 
Figure 2 demonstrates how the weighting function 
works for different values of v|/ in the case of 100 com- 
ponent models. 

In an ensemble of 100 possible component models, a 
\\i of 1.2 would result in 167 draws from the top model, 
80 for the fifth place model, 32 from the 10 th place 
model, two from the 25 th place model, etc., and no 
draws at all from models ranked below the top 32 
(effectively placing a threshold such that the bottom 
68% of models in this case have no bearing on the final 
predictions). In contrast, an ensemble with v|/ = 1 would 
take equal draws from all models. 

We develop and test ensembles that include v|/ values 
from 1.0 to 1.2 in intervals of .01. We can then compare 
these different ensemble models using predictive validity 
performance on the second set of test data. This enables 
us to choose an ensemble model that weights the com- 
ponent models adaptively. An ensemble with some good 
models and some very bad models would perform best 
with a higher \\i value; perhaps surprisingly, previous 
studies have shown that an ensemble with all models 
having similar performance often performs best when all 
models are included (i.e., in the case of CODEm, when a 
lower value of v|/[55] is chosen). 

Because the component models and ensembles are 
developed without any access to the second test dataset, 
the evaluation of performance of each strategy is based 
on a fair out-of-sample test. 

Assessing predictive validity 

Our approach to predictive validity can be divided into 
two distinct parts: the strategy for developing train and 
test datasets, and the metrics used for assessing predic- 
tive validity. 

Developing train-test 1-test 2 datasets 

As noted above, when developing single models and 
ensemble models, we need to create datasets where the 
original data are split into three components: the train 
dataset, the test 1 dataset, and the test 2 dataset. While 
much of the literature in bioinformatics uses 80-20 
splits of the data when there is only one test set, we 
have compromised and used 70% of the data for training 
the models, 15% for test 1, and 15% for test 2. We have 
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Figure 2 Example weighting schemes for 100 component models at different values of v|/ 



found that results from this type of "knockout" strategy 
provide stable model orderings when the process is 
repeated 10 to 30 times. Figure 3 demonstrates that 
component model rankings on the test 1 dataset are 
susceptible to noise when there are just a few hold-outs 
but begin to stabilize after approximately 10 hold-outs. 
Furthermore, it demonstrates that in-sample fit is a 
poor predictor of out-of-sample fit. 

Our knockout pattern operates as follows: we create 
two randomly shuffled lists of all 187 countries, x and 
Y. We take the first country in list x and mark for 
which age-year pairs it is missing data for the selected 
cause and sex. We then knock out any data that are pre- 
sent in those age-years for the first country from list Y. 
For instance, if the country from list x had a missing- 
ness profile like that of Scenario D in Figure 1 (in which 
data after 2000 are not present) and the list Y country 
had all age-year pairs present, then we would knockout 
the data from 2001 to 2010 for the list Y country, leav- 
ing in the data from 1980 to 2000. We repeat this pro- 
cess for pairs of countries until 15% of the data in list Y 



are knocked out, assigning those data points to the test 
1 set. We then continue down the list of country pairs 
until an additional 15% of the data have been knocked 
out, which we assign to test 2. Note that this test is 
excessively stringent, because it does not take into 
account whether or not the country from list Y actually 
contains data in the same age-years as the list x country, 
making the task harder in most cases. 

We create 25 sets of train-test-test datasets based on 
the knockout strategy just described, then run all the 
component models on the training data. We use the 
results of each component model to predict for the test 
1 data. We then assess the performance of each compo- 
nent model in the test 1 dataset using the metrics 
described below. Ensemble model weights are estimated 
based on the performance of component models in test 
1 data. Then the ensemble models are evaluated for 21 
values of \|/ (1.0 to 1.2 at intervals of .01) on the test 1 
data. The ensemble weighting scheme that achieves the 
best predictive validity metrics on test 1 is then used to 
create the final predictions. Its predictive validity is 
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Component Model Rankings Across Holdouts 
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Figure 3 Heatmap of stability in maternal mortality component model rankings Each row represents a component model, and each 
column corresponds to how many hold-outs are used when ranking. Rows are sorted by rankings using all 25 hold-outs. Cells are colored based 
on their ranking, with the green models being the best and red the worst. The last column displays in-sample ranks. The figure shows that 
rankings stabilize as hold-outs increase, and that in-sample ranking does not correlate well with out-of-sample. 
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assessed along with the component models on the test 2 
data. 

Metrics of predictive validity 

Predictive validity is evaluated using three metrics. First, 
we evaluate how well each model for a particular cause 
of death predicts age-specific death rates using the 
RMSE of the natural log of the death rate. Log death 
rates are generally comparable across age groups, so the 
choice of RMSE of the natural log of death rates means 
that we are equally concerned about a 10% error in one 
age group compared to another. Metrics that focus on 
the absolute error would tend to give highest weight to 
model fit in the age groups with higher rates, but most 
users of cause of death data actually attach greater 
importance to accuracy at younger ages where each 
death represents more years of life lost. We have thus 
chosen a method of assessing accuracy that we think 
balances the need to develop models that predict both 
an accurate overall level of mortality as well as an accu- 
rate age pattern. Second, we also desire models that pre- 
dict accurate trends. To do this, for the test data we 
compute where possible the log death rate in year t 
minus the log death rate in year t-1. We also compute 
the same metric for the prediction. We then count the 
percentage of cases for which the model predicts a 
trend in the same direction as the test data. 

In order to compare the performance of multiple 
families of models that may use different measures and 
transformations of cause of death data (e.g., logit cause 
fractions versus log death rates in our case), we must 
choose a single space in which to assess all the models. 
We have chosen to assess performance in log rate space. 
We prefer to assess accuracy of rates instead of cause 
fractions because they are generally more useful from a 
public health perspective - for instance, a declining rate 
represents progress on combatting a cause of death, 
whereas a declining cause fraction could result from a 
stagnating death rate for one cause but an increase in 
another. Log rates are used because of the well-estab- 
lished log-linear relationship with death rate and time 
[28,76]. 

Ideally, we would like to use out-of-sample coverage 
as an additional metric in comparing component mod- 
els' performance. Unfortunately, it is computationally 
intractable to compute full confidence intervals for 
every single component model once the model pool 
becomes large. As computational speed improves, cover- 
age can be added into the model ranking scheme. In the 
case of CODEm, we find the computationally expedient 
expectation of each component model for the ranking 
process. 

We rank each component model on the median of 
these two metrics across hold-outs (lower being better 
for RMSE, higher being better for the trend test). We 



then sum up the ranks across the two metrics for each 
component model, and assign overall rank 1 to the 
model with the smallest sum of ranks, rank N to the 
model with the highest sum of ranks, etc. These ranks 
are then used in the ensemble weighting equation 
described above. 

We similarly rank the ensemble models for different 
values of \|/ on these two metrics and compute the sum 
of their ranks across hold-outs for test 2. We choose the 
value of v|/ that minimizes the sum of ranks and use it to 
make our final set of predictions. 

Finally, an important property of the models is that 
they generate plausible prediction intervals. We there- 
fore compute the percent of data in the test set that 
were included in the 95% prediction interval. The pre- 
diction interval is based both on the uncertainty in the 
predicted death rate and the data variance for each 
observation. We report the mean value of each of these 
metrics across the 25 test 2 datasets. 

Results 

We demonstrate the logic of CODEm and the perfor- 
mance of various models for maternal mortality. Data 
availability for maternal mortality are summarized in 
Table 3. Overall, there are 4,563 site-years available for 
analysis. In addition, there are a number of applicable 
covariates available for analysis. Table 4 provides the 
class 1, 2, and 3 covariates for maternal mortality, along 
with the priors placed on the direction of their 
coefficients. 

The process of running covariate selection for the four 
families of models yields a total of 338 models, as shown 
in Additional File 1 for maternal mortality. In order to 
find these models, the covariate selection tool began 
with 1,984 possible models each for rates and cause 
fractions. We ran 261 regressions on rates and 226 on 
cause fractions and found 98 rate models and 71 cause 
fraction models that fulfilled all of our priors, each of 
which was tested as both a simple mixed effects model 
and a spatial-temporal model. The difference between 
1,984 possible combinations of covariates and the 261 
for which regressions were run relates to the algorithm 

Table 3 CODEm data sources for maternal mortality, by 



source type and decade 



Type 


1980-1989 


1990-1999 


2000-2010 


Total 


Other 


0 


5 


41 


46 


Sibling history 


410 


807 


325 


1542 


Surveillance 


3 


23 


51 


77 


Survey/census 


1 


9 


41 


51 


Verbal autopsy 


49 


72 


87 


208 


Vital registration 


799 


944 


896 


2639 
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Table 4 Covariate priors for maternal mortality covariate 



selection 

Covariate Level Prior 

In-facility delivery coverage 1 

Skilled birth attendance coverage 1 

Malnutrition (proportion under 2SD) 1 

Lnftotal fertility rate) 1 + 

Age-specific fertility rate 1 + 

Health system access score 2 

Antenatal clinic (four visits) coverage 2 

Antiretroviral (ARV) adjusted HIV prevalence 2 + 

Ln(neonatal death rate) 2 + 

Education (year per capita) 3 

Ln(lag-distributed income per capita) 3 



for class 2 and class 3 covariates where models are not 
evaluated if the simpler model with a covariate does not 
have the sign and statistical significance expected. 

As seen in Figure 4, it is evident that performance var- 
ies widely across models; RMSE ranges from 0.616 to 
0.774 across component models in test 1, while the 
trend test ranges from 0.572 to 0.667. In-sample fit 
ranges from 0.368 to 0.550 in RMSE and 0.569 to 0.740 
on the trend test. The performance on test 2 is very 
similar to that of test 1, as also seen in Figure 5, sug- 
gesting that model performance is not a function of the 
particular hold-outs used for the assessment. Figure 4 
shows several important patterns. First, for RMSE, in- 
sample fit for all the spatial-temporal GPR models, 
whether for cause fractions or for rates, are much better 
than for the mixed effects models. This is to be 
expected given that these models tend to capture some 
of the patterns in the residuals that are structured over 
space and time. However, in the case of maternal mor- 
tality, the figure also shows that in the out-of-sample 
RMSE tests, these models tend to also do better, 
although there is clear overlap between classes of mod- 
els. Of note, within a class of models, particularly the 
spatial-temporal GPR models, there is little relationship 
between in-sample fit and out-of-sample RMSE. This 
visual impression is confirmed by the rank order corre- 
lation coefficient within model class between in-sample 
RMSE and out-of-sample RMSE on test 2, which are 
-0.369, 0.463, 0.887, and 0.830 for spatial-temporal 
cause fractions, spatial-temporal rates, linear cause frac- 
tions, and linear rates, respectively. The strongest rela- 
tionship observed between in-sample performance and 
out-of-sample predictive validity is for linear cause frac- 
tion models. However, even a rank order correlation 
coefficient of 0.887 within that class of models means 
that some poorly performing models might be selected 
if selection is based only on in-sample fit. For the better 



performing classes of models, in this case the spatial- 
temporal cause fraction and rates models, in-sample fit 
is not a useful guide for model choice at all. 

Figure 4 also shows a similar analysis of performance 
for the trend test, but the patterns are distinctly differ- 
ent. In the case of maternal mortality, in sample the 
spatial-temporal models do better than the linear mod- 
els. However, out of sample, the cause fraction models 
perform substantially better than the rates models. In 
fact, in both test 1 and test 2 cases, the linear cause 
fraction models have on average slightly better perfor- 
mance than the spatial-temporal models for cause frac- 
tions; there is however substantial overlap in these sets 
of models on this test. Within model class, the rank 
order correlation coefficient for the trend test is 0.675 
for spatial-temporal cause fractions, -0.174 for spatial- 
temporal rates, 0.975 for linear cause fractions, and 
0.984 for linear rates. 

The high correlation of out-of-sample performance for 
RMSE between test 1 and test 2 provides reassurance 
that the ordering of models is not specific to the parti- 
cular set of hold-outs used. Nevertheless, it is important 
to ascertain how many hold-outs are required to get a 
stable assessment of the ordering of different models 
within and between classes. Figure 6 illustrates how the 
stability in component model rankings improves as 
more hold-outs are added. After 25 hold-outs, the rank 
correlation exceeds .98; 40 hold-outs yields rank order 
correlation coefficients in excess of 0.99 for maternal 
mortality. Figure 3 shows that for maternal mortality the 
top ranked models have all become rather stable after 
25 hold-outs; it further demonstrates that in-sample fit 
is poorly correlated with out-of-sample predictive valid- 
ity. Extra hold-outs tend to further stabilize the ranking 
of the poorer performing models rather than change the 
order of the top models in this case. 

Notice in Additional File 1 that spatial-temporal mod- 
els on the logit of the cause fraction occupy the top 88 
slots out of 338 total component models. This indicates 
that spatial-temporal models do indeed do a better job 
of capturing levels and trends not present in covariates 
alone. The fact that cause fraction models perform bet- 
ter than rate models suggests that cause fractions, which 
are constrained to be less than the all-cause mortality 
rate, may benefit from the additional information pro- 
vided by being multiplied by the all-cause mortality rate 
when calculating final predictions. The covariates for 
the best ranked model included the natural log of the 
total fertility rate, skilled birth attendance coverage, and 
the natural log of lag-distributed income per capita. 
Similar models with additional covariates, such as the 
fourth-ranked model, which added age-specific fertility 
rates, performed worse in many cases, suggesting that 
while additional covariates may improve in-sample fit 
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Figure 4 In-sample, test 1, and test 2 performance on RMSE and trend test for maternal mortality component models 



they often cause overfitting that harms out-of-sample case of maternal mortality a \|/ of 1.17 has been chosen, 

predictions. This is a rather steep weighting scheme, giving over half 

Table 5 shows how well ensemble models of varying the weight to just the top five component models - the 

levels of v|/ perform on test 1 predictive validity. In the top component model receives 139 draws, relative to 
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Figure 5 Correlation between test 1 and test 2 rankings of maternal mortality component models 



102 draws for the third-place model, 46 for eighth place, 
15 for 15 th place, etc (see the last column of Additional 
File 1). Interestingly, \\i of 1.0 - which gives equal 
weights to all models - performed the worst on both 
predictive validity tests, suggesting that there was at 
least one bad component model that would harm pre- 
dictions if included. 

Table 6 provides the performance on test 2 of the best 
component model (the spatial-temporal model on logit 
(cause fraction) with covariates of age-specific fertility 
rate, the natural log of total fertility rate, and the natural 
log of lag-distributed income per capita) and of the best 
ensemble model. Results are consistent with previous 
findings that out-of-sample predictive validity metrics 
are improved by creating an ensemble of models instead 
of just using the top component model. RMSE 
decreases, trend test performance remains about the 
same, and the prediction interval coverage also 
increases. 

Finally, we checked the test-retest stability of our 
method by running the same model with 25 hold-outs 



twice on maternal mortality. The Pearson correlation 
coefficient for the log rates of the final mean estimate 
(by country, year, and age) was 0.9998 for the two runs, 
indicating that the method gives consistent results with 
25 random hold-outs. Nevertheless, given that the hold- 
out process and assessment of predictive validity is sto- 
chastic by nature, the exact set of top performing mod- 
els for a cause of death can vary between model runs 
even if the predictions are highly consistent. 

Table 7 shows summary results comparing the ensem- 
ble model to the best component model for several 
causes of death, including cardiovascular disease, 
chronic respiratory disease, cervical cancer, breast can- 
cer, and lung cancer. The table highlights that in each 
case, to a varying degree, the ensemble model has as 
good as or slightly better predictive validity metrics than 
the component model. In most cases the out-of-sample 
performance for RMSE and trend test both improve or 
remain roughly equal when comparing the ensemble to 
the top component model. The one exception is lung 
cancer in males, in which the ensemble model has 
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slightly worse RMSE and trend; however, the coverage 
of the ensemble models is much better than that of the 
component, so this is most likely a desirable trade-off. 
In no case do we observe that the ensemble model per- 
forms significantly worse than the top component 
model, and in many instances the ensemble is a sub- 
stantial improvement. In all cases, the uncertainty inter- 
val coverage is better for the ensemble model than the 
best component model. 

Discussion 

In this paper, we have proposed a framework for cause 
of death model development. We have illustrated this 
approach for several causes of death including maternal 
mortality, cardiovascular disease, chronic respiratory dis- 
eases, cervical cancer, breast cancer, and lung cancer 
mortality. Our strategy yields an ensemble model with 
smaller error in estimated rates than the single best 
model. It yields more accurate trends and often better 
uncertainty intervals with nearly 95% data coverage in 
out-of-sample prediction tests. We have used maternal 
mortality in this paper to illustrate the results of 



CODEm, but we have applied this strategy to many 
causes with similar findings; in addition, separate papers 
utilizing this method include more detailed results for 
maternal mortality, breast cancer, and cervical cancer 
[89,90]. The model development strategy we have used 
has already been widely applied in other fields; what we 
have done is to develop a pragmatic implementation of 
these ideas to cause of death modeling. 

Debate on cause of death estimation, such as for 
maternal mortality [36,38-41], can be traced to three 
components: the database used, including covariates; the 
processing of data to enhance comparability and quality; 
and model building. By encompassing a very wide range 
of models and model families combined with objective 
assessment of performance through out-of-sample pre- 
dictive validity tests, we expect that our approach will 
substantially decrease debate around model building. 
Debate in the future will more likely focus on the data 
processing step. With regards to the database, most ana- 
lysts will agree that a more systematic collation of all 
data sources is preferable to fewer data sources. The 
real issue is how one uses or modifies observations in 
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Table 5 Ranking and predictive validity metrics for maternal mortality for different values of \|/ 



Rank 




RMSE (in-sample) 


RM5E (test 1 ) 


Trend test (in-sample) 


Trend test (test 1) 


1 


1 .1 7 


0.368 


0.61 7 


0.737 


0.737 


2 


1 .1 6 


0.368 


0.61 7 


0.737 


0.737 


3 


1 .1 5 


0.368 


0.61 7 


0.737 


0.737 


3 


1.11 


0.369 


0.61 7 


0.737 


0.737 


5 


1 .08 


0.369 


0.616 


0.737 


0.737 


6 


1 .1 2 


0.368 


0.61 7 


0.737 


0.737 


7 


1 .1 8 


0.368 


0.61 7 


0.737 


0.737 


8 


1 .06 


0.370 


0.616 


0.736 


0.736 


9 


1 .03 


0.376 


0.61 5 


0.734 


0.734 


10 


1 .04 


0.373 


0.61 5 


0.735 


0.735 


1 1 


1 .09 


0.369 


0.616 


0.737 


0.737 


12 


1 .05 


0.371 


0.616 


0.736 


0.736 


1 3 


1 .1 


0.369 


0.61 7 


0.737 


0.737 


1 4 


1 1-1 


0 368 


0617 


0 737 


0 737 


14 


1.13 


0.368 


0.617 


0.737 


0.737 


16 


1.19 


0.368 


0.617 


0.737 


0.737 


17 


1.02 


0.381 


0.614 


0.732 


0.732 


18 


1.2 


0.368 


0.617 


0.737 


0.737 


19 


1.07 


0.369 


0.616 


0.737 


0.737 


20 


1.01 


0.390 


0.614 


0.726 


0.726 


21 


1 


0.412 


0.624 


0.709 


0.709 



the largest database. Controversy will remain on how to 
deal with misclassification in cause of death assignment, 
subnational and nonrepresentative samples, and the 
exclusion of data points due to extreme nonsampling 
variance. We hope that by proposing and implementing 
a structured approach to model building, the idiosyn- 
cratic aspect of final model choice will be minimized or 
even eliminated. Debate on data processing will con- 
tinue and hopefully be fertile ground for methods inno- 
vation in the future 

In various conferences following the release of the 
maternal mortality studies in 2010 [36,38,39], some 
commentators argued that simple models should be pre- 
ferred to more complex models such as GPR. Underly- 
ing these comments are two possible perspectives. First, 
simple or parsimonious models may actually generate 
better predictions out of sample. Given that we test all 
models, whether simple, more complicated, or ensemble 
on a level playing field, if simple models are better they 
will be selected as the basis for making predictions in 



our modeling framework. The second possible interpre- 
tation of this view is more philosophical, namely that 
simple models are intrinsically preferable regardless of 
predictive performance. In other fields, like weather 
forecasting, consumers do not demand simple models; 
they demand accurate models. Not surprisingly, weather 
forecasters - who have in some ways an easier task than 
cause of death estimation, as their forecasts are shorter 
and have much more data to rely on - use dynamic 
ensemble models where the weights on various models 
are selected each day for tomorrow's forecast [87] in 
order to generate predictions with the smallest error 
and accurate uncertainty intervals. This is entirely 
inconsistent with a call from some in global health to 
sacrifice performance of cause of death models for the 
sake of simplicity. In this era, we all use complex devices 
such as computers, cell phones, or even watches. Most 
of us do not understand how the circuitry or machinery 
within the device works. We judge these devices by how 
well they perform, not how well we understand their 



Table 6 Predictive validity measures for final CODEm maternal mortality model 



Model 


RMSE (in-sample) 


RMSE (test 2) 


Trend test (in-sample) 


Trend test (test 2) 


Coverage (in-sample) 


Coverage (test 2) 


Top ensemble 


0.368 


0.626 


0.739 


0.665 


99.2% 


97.3% 


Top component 


0.369 


0.636 


0.737 


0.663 


99.2% 


96.9% 



Performance of the top component model is also shown for comparison. 
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Table 7 Predictive validity measures for CODEm results for several selected causes of death 


Cause 


Model 


RMSE 
(in-sample) 


RMSE 
(test 2) 


Trend test 
(in-sample) 


Trend test 
(test 2) 


Coverage 
(in-sample) 


Coverage 
(test 2) 


Cardiovascular disease 
(male) 


Ensemble 


0.123 


0.290 


0.764 


0.699 


0.993 


0.853 




Best component 
model 


0.122 


0.292 


0.764 


0.693 


0.992 


0.822 


Cardiovascular disease 
(female) 


Ensemble 


0.162 


0.304 


0.750 


0.698 


0.993 


0.903 




Best component 
model 


0.162 


0.305 


0.751 


0.707 


0.993 


0.877 


Maternal mortality 


Ensemble 


0.368 


0.626 


0.739 


0.665 


0.992 


0.973 




Best component 
model 


0.369 


0.636 


0.737 


0.663 


0.992 


0.969 


Cervical cancer 


Ensemble 


0.528 


0.944 


0.667 


0.593 


0.988 


0.938 




Best component 
model 


0.489 


0.951 


0.678 


0.587 


0.982 


0.91 2 


Rff»a^t ranrpr (fomalol 

Dl COAL \_alHCI \ICIIIGIC/ 


Fnc;pm ho 
LI lit: I I I Ul tr 


0438 


1 105 


0 678 


0 572 


0 985 


0 923 




Best component 
model 


0.428 


1.129 


0.675 


0.570 


0.981 


0.875 


Lung cancer (male) 


Ensemble 


0.359 


0.769 


0.714 


0.599 


0.993 


0.841 




Best component 
model 


0.330 


0.759 


0.738 


0.617 


0.993 


0.800 


Lung cancer (female) 


Ensemble 


0.383 


0.734 


0.705 


0.580 


0.995 


0.932 




Best component 
model 


0.371 


0.756 


0.738 


0.597 


0.994 


0.897 



inner workings. We believe that the same ethos ought to 
be applied for cause of death modeling. The predictive 
validity metrics we have described in this paper are use- 
ful not only for ranking researchers' own models and 
creating ensembles, but we believe they can also provide 
a framework within which modeling strategies from 
diverse research groups can be fairly compared. Since it 
is good out-of-sample predictive performance that 
should ultimately decide which models are most useful, 
then we think models should be evaluated and com- 
pared on those grounds. 

The current implementation of our strategy has sev- 
eral limitations. The Netflix challenge experience sug- 
gests that pooling results from very different modeling 
strategies performs better [54]. We have used four mod- 
eling families that are the product of modeling natural 
log rates and logit cause fractions, mixed effect linear 
models, and spatial-temporal GPR models. This creates 
a diverse pool of models that compete in predictive 
validity tests. For example, combined with the covariate 
selection approach, we had in the case of maternal mor- 
tality 7,936 possible component models. Nevertheless, 
many other modeling strategies are possible; given the 
experience in other areas, expanding the universe of 
modeling families that are included in the model pool 
will likely improve prediction performance. The funda- 
mental challenge to expanding beyond our four families 



of models is computational power and time. In our 
implementation, we have already reached the limit of 
what is currently practical - each model run takes 
approximately 600 hours of processor time, which for 
our cluster is equal to about 5,000 GFlops. As computa- 
tional power and cloud computing get progressively 
cheaper and estimation algorithms more and more opti- 
mized, it may be possible in future iterations to increase 
the diversity of the model pool and therefore the predic- 
tion performance of the model ensemble. As with any 
modeling strategy, the approach outlined here depends 
critically on the database that is used for estimation, the 
quality of the covariates, and the assessment of the evi- 
dence on the expected relationships for covariates. 

Although model building is becoming a computation- 
ally intensive but replicable and less subjective task, the 
challenge remains of capturing uncertainty due to cor- 
recting data for bias, excluding data points as outliers, 
and measuring and forecasting covariates. Garbage code 
redistribution algorithms are largely based on expert opi- 
nion, although some empirical approaches have been 
applied [51]. Methods to capture uncertainty in these 
approaches need to be developed. For verbal autopsy, the 
results of the Population Health Metrics Research Con- 
sortium verbal autopsy studies [91,92] provide a potential 
strategy for assessing misclassification uncertainty. For 
outliers or observations with high nonsampling variance, 
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one can imagine a number of strategies. Instead of 
assigning 0 or 1 weights to a data point, 0 if the point is 
an outlier and 1 if it is not, numbers from 0 to 1 could be 
assigned that represent the probability that a data point 
is an outlier. Repeated sampling of the dataset could be 
undertaken using this matrix, allowing for propagating 
uncertainty. Alternatively, outliers could be assigned 
higher nonsampling variance in techniques such as GPR 
that can take this into account. The covariates we use as 
inputs to our models also have measurement error, and 
because they are often themselves modeled they may also 
have prediction uncertainty. It is unclear if incorporating 
this uncertainty would be helpful in predicting causes of 
death since we test out-of-sample data coverage to check 
that it is close to 95%, but because of computational lim- 
itations we have been unable to test the usefulness of 
including such information. These approaches all repre- 
sent attractive avenues for further exploration. The main 
limitation is the computational time and cost that such 
schemes would imply. 

In many cases, cause of death information is most 
valuable when placed into context with the entire cause 
of death composition of the population. The relative 
burdens of causes of death are often more influential in 
priority setting than are the individual causes' respective 
sizes. For this reason, it is often desirable to produce 
estimates of mutually exclusive and collectively exhaus- 
tive cause of death categories. Ideally, one would design 
a model that predicts for all causes of death simulta- 
neously, accounting for correlations and dependencies 
between them. There are, however, three major limita- 
tions on modeling all causes simultaneously: severe 
restrictions on model diversity, limitations on data 
usability, and infeasibility for analysis. 

First, current computational restrictions make it 
intractable to fit multinomial or seemingly unrelated 
regression models while still utilizing the wealth of spa- 
tial-temporal information present in the datasets. We 
have found that borrowing strength across space, time, 
and age groups in the data has substantially improved 
predictive accuracy. Current techniques do not allow for 
us to take advantage of such modeling advances within 
a multinomial framework, unfortunately; as computa- 
tional power continues to increase and Bayesian sam- 
pling methods for highly multidimensional problems 
improve, such a model will likely be feasible in the 
future. 

Second, for many causes of death such as maternal 
mortality, there are extensive datasets that provide mea- 
surements of only one cause of death. For example, 
modeling all causes of death of reproductive aged 
women simultaneously would limit the use of these rich 
sources of data. The same situation exists for many spe- 
cific causes of death where there are cause-specific 



datasets, such as population-based cancer registries that 
do not include cause of death information for a set of 
mutually exclusive and collectively exhaustive causes. 
Finally, we believe it is infeasible for all studies of causes 
of death to always examine all causes simultaneously. 
Researchers will legitimately want to investigate single 
causes of death at a time in addition to more compre- 
hensive studies covering all causes. 

For those cases such as the GBD Study, where esti- 
mates are undertaken for a set of mutually exclusive 
and collectively exhaustive cause of death categories, we 
recommend a two-step process. First, produce models 
with the best out-of-sample predictive validity for each 
cause of death independently, taking advantage of all 
available high quality data and capturing spatial-tem- 
poral patterns in the data. Second, predicted cause of 
death estimates for each cause can be modified to sum 
to the total all-cause mortality predictions. This might 
be desirable because there are substantially more data 
available for all- cause mortality, meaning we have more 
confidence in these predictions of total mortality than 
we would in simply combining all the single cause mod- 
els together and taking the sum to be all-cause mortal- 
ity. The most appropriate methods for combining 
estimates of cause-specific mortality in a country-sex- 
age group to equal the all-cause mortality estimates that 
appropriately take into account the variance of different 
predictions by cause is a topic of active research beyond 
the scope of this study. While such an approach is not 
ideal, it outperforms models that make the trade-off of 
using much more restrictive assumptions in order to fit 
within a framework that models all causes simulta- 
neously. Demonstrations of the overall predictive validity 
of the two-step approach applied to selected cases will 
be an important area for future research. 

Our model building strategy implemented in CODEm 
and alternative implementations that will surely come in 
the future all require access to substantial computational 
power. Cheaper cloud computing and processors mean 
that many research groups around the world already 
have access to the required computational resources. 
These groups will likely pursue even more sophisticated 
approaches in the coming years that have more diverse 
model pools and capture uncertainty that stems from 
data processing. But for many users, even the current 
strategies will be beyond their available computational 
power. We believe that it will be important to catalyze 
access to these approaches for users especially in the 
developing world. Through internet-accessed servers 
dedicated to this type of processing, it should be possi- 
ble to find a feasible way to make these tools more 
widely available. 

No doubt, vigorous debate on the level and trend of 
important causes of death such as maternal causes, 
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malaria, and cardiovascular diseases will continue. Such 
debates will continue even in high-income countries 
with complete vital registration and medical certification 
of causes of death because of misclassification biases. 
Cause of death estimation will continue to rely on 
appropriate modeling strategies to better understand 
real, underlying trends in diseases and injuries. Consen- 
sus around a set of principles for cause of death model 
building will be of great and immediate value to the glo- 
bal public health community who rely on this informa- 
tion for monitoring progress with development goals. 
CODEm is one pragmatic implementation of the princi- 
ples laid out in this paper, one that will have extensive 
use in the GBD 2010 Study. More widespread imple- 
mentation of these principles will hopefully continue to 
emerge over the next few years, leading to a much bet- 
ter understanding of levels and trends in key causes of 
death. 

Additional material 



Additional file 1: Descriptions, rankings, and predictive validity 
metrics for maternal mortality component models 
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